DSCI 553 Statistical Inference and Computation II - QUIZ 1 REVIEW

Chuang Wang







Lecture 1. Overview & Probabilistic Generative Models

1. Statistical inference

Using observed data to estimate and characterize uncertainty in unobserved (or latent) quantities of interest.

These latent quantities may be real but not directly observable...

Or completely hypothetical...

Statistician's usual hammers: point estimates and confidence intervals

2. Generative models & Probabilistic generative models

Generative models

Determinstic generative model = no room for noise. Trying to incorporate noisy measurements means inconsistent equations. Predictions will always be the same, so can't account for a model simplifications (model has to be perfect)

Probabilistic generative models

3. Generative model: math

$X_i \sim \mathrm{Bern}(p)$, $i=1, \dots, M$

It would be really hard to make a deterministic generative model for this.

4. Generative model: code

We will use the BUGS language and JAGS software in this class to:

Let's start by generating some synthetic data for this problem.

The generative model is all you need (and all you get)

  • once you have a generative model, you can derive everything: tests, inference, etc
  • rule of thumb: if your model can generate it, it will be handled in inference
    • missing data, dependence, complex data types, etc
  • if your model can't generate it, it will not be handled

5. A Bayesian approach

Collect data: flip the cap 3 times! and Estimate $p$

a). Standard frequentist approach.

(how do we find the maximum likelihood estimate for $p$ between 0 and 1?)

b). Bootstrap approach (without doing boatloads of math)

c). Bayesian approach

Let's try a slightly different generative model where we specify how both $p$ and $X_i$ are generated.

Pros:

6. JAGS (Bottle cap)

Data

Prior

Likelihood

Latent Variable

For now, we can just set $C = 1$. Later we'll see why more chains can be useful.

7. Activity (O-ring Failure)

1). setup

problem:

infer the probability of at least one O-ring experiencing thermal distress for the forecast launch temperature of 31F.

data:

Likelihood

$$\mathbb{P}\left(Y_i = 1\right) = \left[1+\exp\left(a (x_i - b)\right)\right]^{-1}$$

prior:

  • In the Bayesian approach to statistics, we assert that the unknown parameters $a$ and $b$ are random; to complete our model, we need to specify our prior distribution on the unknown parameters $a$ and $b$ before seeing any data.
  • Based on our engineering understanding of O-ring materials, we might be certain that failure probability decreases smoothly with temperature (so $a$ should be a small positive number), and that the transition might occur somewhere from 0F to 60F. So we could specify our prior as

What happens when $a$ is positive or negative?

  • When $a$ is positive then the distress probability decreases with temperature; when it's negative the distress probability increases with temperature.

What happens when the magnitude of $a$ is large or small?

  • Larger $a$ means the transition from high to low (or vice versa) probability happens more abruptly at a temperature of $b$.

What does $b$ control?

  • The temperature at which the distress probability is 0.5.

Latent variable:

2). Data

3). Model

Important Note: The normal distribution dnorm in JAGS is parametrized by mean and precision (1/variance), not mean and variance as we usually do. So even though I wrote $b \sim \mathcal{N}(70, 200)$ above, below I'll use the code b ~ dnorm(70, 5e-3) below (the 5e-3 = 1 / 200).

2). Sample & Visualize the Prior

In order to use JAGS to take samples from the model, we need to give it the values of any known quantities. JAGS will then automatically take samples of any remaining unknown quantities.

Here, there are three quantities we can give JAGS:

Fill in the below code to create the "data" for JAGS to sample from the prior.

Hint: If we want to take samples from the prior, should we give JAGS any observations?

Run the below code to take samples from the prior.

As a summary of what happens: first, you create a pyjags.Model object, with arguments:

Then you use the .sample() function to take samples, with arguments:

Discuss the following prompts with your neighbour:

Discuss the plots that are generated. What do they show?

The first two plots show the marginal distribution of $a$ and $b$ in the prior. The 3rd plot shows samples from the prior. We can think of the prior as a distribution over curves (since a curve is defined by $a$ and $b$) and the black curves are samples from this distribution. The 4th plot is the distribution of the distress probability at 31F according to our model.

How is this different from the maximum likelihood approach?

In MLE we wouldn't have a prior at all.

3). Sample / Visualize the Posterior

Let's rerun this analysis, except this time we will include our observed data in our model. We are still going to use the logistic regression model to describe our observed launch success/failures given the parameters $a$ and $b$. But now our distribution on $a$ and $b$ will incorporate the information from our observations.

Change the JAGS data (create a dictionary named jags_data_posterior) to incorporate the observed data.

Discuss the plots that are generated. What do they show?

See answers above; the plots are the same but now showing the posterior distribution instead of the prior.

How is this different from the prior distribution?

We can see the distributions have shifted and furthermore that our marginal distributions of $a$ and $b$ now have a significantly lower variance (we are more sure of things after observing the data). This is reflected in the 3rd plot where the samples of $(a,b)$ (illustrated as curves, one curve per sample) is a lot less "messy" because we know more about $a$ and $b$. The last plot does not look encouraging at all in terms of launching a rocket on that day.

How is this different from the maximum likelihood approach?

In MLE we'd pick the most likely values of $a$ and $b$ and take these as a given. Using this $a$ and $b$ we'd then use our regression curve to get us a distress probability at the temperature we're interested in, namely 31F. Here, instead, we have a distribution over $a$ and $b$ and thus we get a distribution over distress probabilities at 31F.







Lecture 2. Bayesian Inference

1. Bayesian inference

Generative model: I'm going to flip two bottlecaps, with $X, Y \in \left\{0, 1\right\}$. They have probability of landing right side up (value 1) $p = 0.7$.

  • if the 1st one lands right side up ($X = 1$), I will cheat and force the 2nd one to be right side up ($Y = 1$)
  • if the 1st one lands upside down ($X = 0$), I will flip the 2nd one fairly $Y \in \left\{0, 1\right\}$.

use observed $Y=1$ to infer $X$

  • We will use the conditional distribution of $X$ given $Y$!
  • We want to know $P(X = k | Y = 1)$ for each $k \in \{0, 1\}$.

2. Bayes' rule

To compute $P\left(X = k | Y = 1\right)$, we need a rule from probability theory:

The law of conditioning: If $X$ and $Y$ take values in $\left\{1, 2, \dots, K\right\}$, then $P\left(X = k, Y = j\right) = P\left(X = k | Y = j\right)P\left(Y = j\right)$

Using this rule twice, we have

$P(X = k | Y = 1) = \frac{P(X = k, Y = 1)}{P(Y = 1)} = \frac{P(Y = 1 | X = k) P(X = k)}{P(Y = 1)} \propto P(Y = 1 | X = k) P(X = k)$

Bayes' Rule: The posterior is proportional to the likelihood times the prior

3. Posterior distribution

The posterior distribution is ... well, a distribution! So you can use it for

Additional (not covered):

Our description of uncertainty is intuitive; it's a probability distribution over the parameters we don't know!

4. JAGS (Bottle Caps)

1). Setup

Problem

Data

Likelihood

Single toss likelihood: $P(X_n = x_n | p) = p^{x_n}(1-p)^{1-x_n}$, then for all data

Prior

Mixture Distributions

  • Build a mixture of two betas! We add an auxiliary variable $z \in \{0, 1\}$:

$z \sim \mathrm{Bern}(1/2)$

We can do this concisely with

$$p \sim \mathrm{Beta}(7z + 2(1-z), 2z + 7(1-z))$$

(now we have two latent variables $p$ and $z$, and we need to infer both! But we only care about the marginal posterior on $p$)

Let's plot the density of our mixture (sum of weighted individual densities)

Reference: Beta distribution

The beta distribution $\mathrm{Beta}(\alpha, \beta)$ is a distribution on [0, 1]

Density: $f(p) \propto p^{\alpha - 1}(1-p)^{\beta - 1}$

Does this fit our subjective understanding of our uncertainty in $p$?

Latent Variable

2). Code

3). Posterior point estimates

We often want to summarize our posterior distribution with one representative value

Mean: the conditional expectation of latent variables given data: $\mu = \mathbb{E}\left[p | X\right]$

Median: the value $m$ such that $P(p \leq m | X) = P(p > m | X) = 0.5$

Mode (Maximum a posteriori, MAP): the value $p^\star = \arg\!\max_p f_{p | X}(p)$

4). Posterior uncertainty

Variance: $\sigma^2 = \mathbb{Var}\left[p | X\right] = \mathbb{E}\left[p^2 | X\right] - \mathbb{E}\left[p | X\right]^2$

$k$% credible interval: any interval $I = [a, b]$ such that $P(a \leq p \leq b | X) = \frac{k}{100}$.

5000 samples

5). Monte Carlo

Monte Carlo is a super broad term that basically refers to simulating things, and usually computing expectations.

Discuss with your neighbour:

JAGS generates samples $(p_s)_{s=1}^S$ from the posterior (more on this next week)

Then we estimate any posterior expectation using Monte Carlo:

$\mathbb{E}\left[f(p) | X\right] \approx \frac{1}{S}\sum_{s=1}^S f(p_s)$

1000 samples

5. Activity (CO2)

Your task is to use the Mauna Loa CO2 dataset to infer:

  1. how quickly atmospheric CO2 is increasing yearly
  2. whether or not the increase is accelerating

1). Setup

We will use a Bayesian statistical model for this data known as Bayesian linear regression. The model includes 3 parameters/latent variables.

data

Likelihood

Latent Variable

Prior

Latent Variable

2). Data

3). Model

4). Sample / Visualize the Prior

Is this a reasonable prior distribution? Why or why not?

It seems OK, although a lot of the curves go up or down a huge amount. Perhaps a tighter prior on $a$ and/or $c$ might have kept things more in check, if one had the scientific knowledge to pin them down a bit more. In a way I'm cheating a bit because I'm looking at the data when I say this, whereas we're supposed to pick the prior before seeing the data.

5). Sample / Visualize the Posterior

How does this differ from the prior?

The marginal variances of our parameters are much smaller now - in other words, we've "pinned down" the parameters decently well with our data. The curves (samples from the posterior this time) are much less wild this time.

The samples from the posterior in the top plot lookl ike a pretty good fit to the general trend, yes. If you wanted to model the oscillations you'd need to build that into the model as well.

How many ppm does the CO2 concentration tend to increase by each year?

I guess this is asking about $a$, which is the rate of increase in 1974 (the rate of increase goes up later). So, around 1.2 ppm/year.

How many (ppm/year) does it increase by each year?

This is $c$. So, I guess around 0.13 ppm/year/year.

6). Posterior Point Estimates / Intervals

Your final task is to use the samples to produce a posterior median point estimate as well as posterior 90% credible intervals for each variable in the model. Relative to their means, are the credible intervals large or small? Discuss with your neighbour.

Relative to their means, are the credible intervals large or small?

Pretty small I'd say. For $a$ the mean is around $1.16$ and the interval size is around $0.02$. For $b$ it's even smaller. For $c$ the mean is $0.013$ and the interval size is $0.0005$, so also small.







Lecture 3. Bayesian Model Design

1. Steps

Step 1. Scientific question

Step 2. Design

formulate variables and create a probabilistic model for them

  1. pick a likelihood distribution such that

    likelihood is the conditional density of the data given the parameters.

    • again respect type and support |support|connot pick|could pick| |--|--|--| |integers|normal distribution|
    • Poisson
    • negative binomial distribution
    • | |positive numbers|normal distribution|
    • Gamma
    • exponential
    • log normal distribution
    • | |probability (0-1)|normal distribution etc.|
    • Beta
    • |
  2. has enough parameters that help you answer the scientific questions (for example, you want to know the variance/)
  3. Pick a prior distribution for the parameters in your likelihood distribution
    • need to pick prior distribution for all unknow parameters
    • prior distribution needs to reflect your uncertainty
    • prior distribution must have right support

      for example, you cannot pick $\lambda \sim N(0, 1)$ for $\lambda$ in $Poisson(\lambda)$

    • typical technique
      • pick a family that respects your support (e.g. normal, gamma)
      • pick hyperparameters to capture the uncertainty within that family.

Step 3. Infer

get "posterior samples" from the conditional distribution of any variable you want to infer given your observed data

Step 4. Prediction

Prediction of $y_i$ has two sources of randomness/uncertainties.

  1. $y_i$ is random from the distribution. (Built-in uncertainty from the likelihood)
  2. The parameter theta of the distribution is also a random variable from another distribution. (The prior of the parameter)
  3. Lecture note

Step 5. Check

make sure the "samples" are actually from your posterior (covered later!)

Step 6. Analyze

Use your samples to compute summaries: mean, MAP, variance, credible intervals, etc

Points to consider when building a model

2. Bike share

1). Setup

Question:

Data

Covariates

These are covariates that will modify the likelihood of our observations $N_i$ (# trips)

  • e.g. in regression, you have inputs $x$ and outputs $y$; the inputs are covariates
  • here we expect more trips when it's warmer / weather is nicer, so $w_i$ and $t_i$ should influence $N_i$
  • weather situation $w_i \in \{1, 2, 3, 4\}$ (categorical) - (note: weathersit = 1 (clear/partly cloudy), 2 (misty/cloudy), 3 (precipitation))
  • temperature $t_i \in \mathbb{R}$ (numerical)

Likelihood

Cool! This created some unknown parameters: for each $w\in \{1, 2, 3\}$,

  • $\alpha_w$ is the slope of mean trips
  • $\ell_w$ is the "elbow location" where riders start to take trips

Prior

What value for the hyperparameters $\mu, \sigma^2, k, \theta$?
Make them reflect uncertainty!

  • $\mu = 0$, $\sigma^2 = 15^2 = 225$
  • $k = 1, \theta = 1$

Latent variables

7 in total. $\alpha_w$ and $\ell_w$ for each of 3 weather conditions = 6. Those are all continuous. There's also $N$. It is discrete (integer).

2). Data

3). Model

4). Posterior samples

3. Activity (Ranking Baseball Teams)

1). Setup

TASK:

Data

Latent Variable

Likelihood

  • (score diff)_i ~ Poisson(mean_i)
  • where mean_i = $\log(1+\exp(w - \ell))$. The following is the reason

How do we decide what function to use? Let's think about how team skill relates to the Poisson mean:

  • If a team is much more skilled than another, is the score difference likely to be higher or lower?
  • Can the Poisson mean be negative?

If the skill value for winning team is $w$ and losing team is $\ell$, and the observed score difference is $d$ then an appropriate likelihood could be

$$\text{score_diff} \sim \mathrm{Poiss}(\log(1+\exp(w - \ell)))$$

but there are many other options. As long as the poisson mean increases with $w-\ell$ and is always positive, it's a decent choice.

Prior

Now that we have a likelihood for our data, we need to pick a prior distribution for each of our unknown skill values $S_t$, $t=1,\dots, T$.

  • should the prior make each $S_t$ independent? Or dependent?
  • since $S_t\geq 0$, what distribution can we use as a prior?

Many good choices here. Either Gamma or Exponential have the right support (nonnegative reals).

We went with Gamma(1,1) below.

2). Data

wrangle data

We will now wrangle this data into a format that is more useful for our ranking analysis. There are a lot of steps below; feel free to read through the comments above each line. There are three important outputs from this code:

3). Code the model

4). Sample / Visualize the Posterior

Run the below code to use JAGS to take 1,000 samples of all the skill values for all the teams given the observed game score differences.

With these samples, we can create many different visualizations of the posterior distribution. For example, let's start by plotting 50% credible intervals around the median skill for each team.

Use the code below to create a plot of skill credible intervals. Discuss with your neighbour: which teams have the highest skill? How uncertain is the ranking at the top, middle, and bottom of the pack?

Which teams have the highest skill?

chn (Chicago Cubs) and bos (Boston Red Sox).

How uncertain is the ranking at the top, middle, and bottom of the pack?

It is fairly clear which teams have the highest skill, and relatively clear which have the lowest skill. But in the middle a lot of the credible intervals are overlapping, so the ranking is quite uncertain in the middle of the pack.

5). Compute Rankings from Skill

OK, the Cubs chn and the Sox bos are pretty close in terms of posterior skill level distribution. What about ranking?

Complete the below code to

  1. Compute the ranking for the Chicago Cubs (chn), Boston Red Sox (bos), and Toronto Blue Jays (tor) in each posterior sample by finding the ranking of their skill values
  2. Create visualizations that show the marginal posterior distributions of ranks

Your final job is to think about the following problem, and discuss with your neighbour.

Major League Baseball teams are split into two groups: the very creatively named National and American Leagues. And within these groups, they are subdivided further into East, Central, and West divisions. Teams in each one of these subdivisions play each other the most; they play across subdivision slightly less; and they play across league the least.

As it so happens, the Chicago Cubs and Boston Red Sox are in different leagues.

What problems would this cause in the model you designed above? If there are indeed problems caused by this, how might you fix them by redesigning the model?

What comes to my (Mike's) mind is that perhaps one team is in a much "easier" league than the other, so it wins a lot of its game with huge score differences just for that reason. This could inflate its skill rating. Perhaps you could normalize by the number of games played, like the average score difference between the two teams, or something like that.







Lecture 4. Computational Bayesian Inference

Mike's comment from slack:
Monte Carlo is a super broad term that basically refers to simulating things, and usually computing expectations. Markov chain Monte Carlo (MCMC) is a general class of algorithms for sampling, generally used for getting samples from the posterior in a Bayesian model. JAGS stands for Just Another Gibbs Sampler, referring to a particular MCMC algorithm called Gibbs Sampling.

1. Markov chains

A Markov chain is a random sequence in a parameter space $(\theta_0, \theta_1, \theta_2, \dots)$

If we know the current position $\theta_t$, then the distribution of the next position $\theta_{t+1}$ doesn't depend on the past positions $\theta_{1:t-1}$.

Example: $\theta_{t+1} \sim \mathcal{N}(\theta_t, \frac{1}{10}I)$ (say, in 2D)

Let's say I start at $\theta_0 = [10,10]$ and simulate $T$ steps. What happens to the distribution of $\theta_T$ as $T$ gets large?

The variance will grow. mean will stay same

Discuss at your table.

A small change...

This is not particularly useful as-is; the distribution just kind of "blows up" over time.

But what happens if I make a small modification to our random walk?

$\theta_{t+1} \sim \mathcal{N}\left(\frac{9}{10}\theta_t, \frac{19}{100} I\right)$

2. Markov chain Monte Carlo (MCMC)

Design a random sequence $(\theta_t)_{t=1}^T$ such that the distribution of $\theta_T$ converges to our target distribution as $T \to \infty$.

Then treat $\theta_T$ as if it were a real sample from our target distribution.

We can't actually make $T$ infinity...but long as $T$ is large enough, $\theta_T$ should be a good approximate sample from the distribution!

Algorithm (this is what JAGS does):

Turns out this is almost always possible to do for Bayesian posteriors. Details below in the Optional Sections are optional material if you want to learn more about MCMC. JAGS uses a very general-purpose method for making this sequence. Stan (another package) uses a different one. Many exist -- there are entire research fields devoted to these algorithms!

3. MCMC Diagnostics

Gelman-Rubin Convergence Diagnostic

As we saw in lecture, there are lots of potential pitfalls in MCMC (bad step size, not enough thinning, not enough burn-in). So how do we know if it worked and is giving us "good" samples from our true posterior distribution? Thankfully, there are MCMC convergence diagnostics to give us a sense for whether we can trust the output of our MCMC algorithm.

In this activity, we'll use the Gelman-Rubin diagnostic [1] for this purpose. The intuition is that if our MCMC algorithm is working well, if we run multiple copies of our chain, then they should all give us roughly the same answer. In particular we check that quantities related to the variance within in each chain and variance between chains are roughly the same.

The formula for the Gelman-Rubin diagnostic is as follows. Let $N > 1$ be the number of samples from each chain, and let $C > 1$ be the number of chains. Let $X_{cn}$ be the $n^\text{th}$ sample from the $c^\text{th}$ chain.

The sample mean and variance from chain $c$ are $$ \mu_c = \frac{1}{N}\sum_{n=1}^N X_{cn}\qquad s_c^2= \frac{1}{N}\sum_{n=1}^N \left(X_{cn} - \mu_c\right)^2$$

The average within-chain variance is:

$$s^2 = \frac{1}{C}\sum_{c=1}^C s_c^2$$

The overall mean of all samples from all chains is $$ \mu = \frac{1}{C}\sum_{c=1}^C \mu_c$$

The average between-chain variance is:

$$b^2 = \frac{1}{C-1}\sum_{c=1}^C \left(\mu_c - \mu\right)^2$$

Then if the MCMC chains are working well, the Gelman-Rubin diagnostic $\hat R$ is approximately 1, i.e.

$$\hat R = \sqrt{1 +\frac{b^2}{s^2}} \approx 1$$

Typically we say that the chains are working reasonably well if $\hat R \leq 1.1$.

Note: The Gelman/Rubin diagnostic sometimes doesn't make sense for discrete variables (where "mean" and "variance" may not be meaningful). There are other diagnostics out there, but we won't cover them in this class.

[1] Gelman and Rubin (1992). "Inference from iterative simulation using multiple sequences (with discussion)." Statistical Science 7:457-472.

4. Lecture notes

5. Activity (Ames, Iowa Housing Data)

Your task is to perform Bayesian posterior inference on a relatively simple regression model (like the one we worked on in Lecture 2) relating house sale price to lot area, with the goal of understanding the inner workings of Markov chain Monte Carlo (MCMC) algorithms and how we can evaluate how well they work.

1). Setup

The following model will be used to investigate Markov chain Monte Carlo methods:

2). Data

3). Code the model

4). Visualize Posterior samples

Use the below code to sample from the posterior.

Note: Now is when we'll set # chains to be greater than 1 (below, I chose 5). We need to do this to help us evaluate the quality of the samples from JAGS. If we set chains = C, then JAGS will just repeat the same process C times (with C different results!).

The output post_samps will have shape (C x N x D), where C is the # chains, N is the # samples, and D is 3 (for a, b, and r).

5). Interpretation

According to our posterior the value of $r$ is relatively small - very unlikely to be below $0.1$. In the plot of the joint distribution of $a$ and $b$, we can see a negative correlation. This makes sense since if the slope is larger (larger $a$) then you'd need to shift your regression curve down (smaller $b$) for it to still pass through the cloud of points (and vice versa). It would also be interesting to look at the joint distribtuions of $(a,r)$ and $(b,r)$ according to the posterior:

6). Markov Chain Monte Carlo (MCMC)

That pesky nonlinearity variable $r$ means we can't evaluate the exact posterior easily. Generally speaking, sampling from distributions is quite difficult. When the distribution has a form we recognize (Gaussian, gamma, exponential, Poisson, etc...), it's easy; but Bayesian posteriors are rarely of a "nice" form like that. So then how does JAGS take samples from the posterior distribution?

As it turns out, it's actually not directly taking samples from the posterior distribution. Instead, it is simulating a random process called a Markov chain that explores your parameter space in a very carefully designed way. In particular, it is designed so that you can:

  1. Start the process at whatever parameter value you want
  2. Close your eyes
  3. Let the process run for a long time, exploring (randomly!) through your parameter space
  4. After the process has run for a long time, you are uncertain about where it is (it's random!). As it turns out, that unknown location has almost the same distribution as a sample from the posterior. So,
  5. You open your eyes, record the location of the process as a sample from your posterior, and return to 2.

Since we simulate a Markov chain to approximate the posterior using samples (i.e., Monte Carlo), it's called Markov Chain Monte Carlo (MCMC).

7). Gelman-Rubin

Code the formula to compute the diagnostic for each variable in our sample. Remember that our samples are a numpy array with shape C x N x 3, where C is the number of chains, N is the number of samples, and the 3-size axis holds the (a, b, r) values. So the function below should return a size 3 array (one $\hat R$ value for each variable).

Use your code to compute the Gelman-Rubin diagnostic for the 3 posterior chains. Does the diagnostic convince you that JAGS worked well? Discuss with your neighbour.

8). Increase num of chains

Now increase the number of chains that you use when running MCMC to 20, and rerun the Gelman-Rubin diagnostic. Did the answer change?

The result changed a bit, not much.







Practice quiz 1

Question 1

Consider the following JAGS code

X ~ dnorm(mu, 0.01)
sigma <- 10
mu ~ dnorm(0, 1/sigma^2)

Which line is the prior?

Answer: mu ~ dnorm(0, 0.0001)

Question 2

[Continuing with the model from the previous question]

True or False: decreasing sigma could, for certain values of X, cause the mean of the posterior distribution to decrease.

Answer: True (Note: in hindsight this question was too difficult)

Question 3

[Continuing with the model from the previous question]

True or False: increasing sigma could, for certain values of X, cause the mean of the posterior distribution to decrease.

Answer: True (Note: in hindsight this question was too difficult)


Q2&Q3

sigma controls how certain your prior is

Since X ~ normal(mu, 0.01), we expect the posterior mean to be somewhere between 0 and X (A sample)

and the prior variance / likelihood variance controls where the posterior mean will be:

Answer:

  • Q2: let's say X = 1. If I start decreasing the prior variance (i.e., making the prior more and more certain around 0) then my posterior mean will move towards 0 (decrease)
  • Q3: on the other hand, let's say X = -1. Then if I start increasing the prior variance (i.e. making the prior less and less certain around 0) then my posterior mean will shift downwards (decrease) towards X= -1

EX1b in lab2

why is the credible interval/variance using the Beta(1,10) prior wider/bigger than the credible interval & variance using the Beta(1,1) prior? The variance of Beta(1,10) is smaller than Beta(1,1):

Answer:

  • This is because the Beta(1,10) prior favors smaller values of p, whereas the data favors larger values of p. So they disagree with each other, resulting in more uncertainty. (Prior drags to the smaller values of p, and likelihood drags to the bigger values of p)
  • If you try a Beta(10,1) prior, which is the exact same variance as Beta(1,10), you’ll see a smaller credible interval. (both favors larger values of p)

Conclusion

  • With less data/If my likelihood variance is very large relative to my prior variance, the prior had much more of an effect on the posterior distribution: the mean shifted all the way to the value specified in prior.
  • </font>


    Question 4

    For the helicopter ride problem from Lab 2, how many latent variables are there? What are they? Are they discrete or continuous?

    Answer: One. It's the software they are using. It's discrete.

    Question 5

    In the bike share problem from Lecture 3, how many latent variables are there? What are they? Are they discrete or continuous?

    Answer: 7 in total. alpha and L for each of 3 weather conditions = 6. Those are all continuous. There's also N_new. It is discrete (integer).

    Question 6

    Which of the following can be obtained from posterior samples (e.g. from JAGS)? Select all that apply.

    Answer:

    • Posterior mean
    • Posterior median
    • Credible intervals
    • MAP is a bit ambiguous; one could try to get the MAP but it's messy with samples from a continuous distribution

    Question 7

    Bayesian inference V.S. MLE/MAP (lab2)

    Answer:

    • In MLE and MAP we take our best guess of the AI as the truth when we compute the expected utility in the subsequent step; we therefore ignore our uncertainty over which AI the company uses. Since dflyr happens to be the best among the three, the (conditional) expected utility in 2(b) and (d) are overly optimistic.
    • With the Bayesian approach, we have a natural way to incorporate our uncertainty of which AI the company actually uses, via the posterior probabilities. Here, the (posterior) expected utility quantifies the level of enjoyment we expect to have from a helicopter ride given that we are presented with the flight records. This allows us to take into account the substantial chance (~38%) that the AI is actually Deep Blue Sky which has quite a low expected utility of about -900, for an overall expected utility of around $-287$. This is well below zero so no, I would not ride the helicopter.

    Question 8

    When you call model.sample in pyjags, you can enter in a thinning parameter called thin. This keeps only every thin samples and discards the rest. What is disadvantage of using thinning (i.e. thin>1)?

    Answer: Disadvantage: slower.

    Question 9

    Answer True (T) or False (F)

    1. The influence of the prior distribution is reduced as the number of data points increases;
    2. If we have enough data, the posterior distribution evaluated at the true parameter value will be non-zero even if the prior distribution is zero there;
    3. In Bayesian inference, the posterior distribution encodes all of our beliefs about our latent variables(s) after observing data;
    4. Increasing the amount of observed data tends to result in a narrower posterior distribution.
    5. Increasing the number of JAGS samples tends to result in a narrower posterior distribution.

    Answer:

    1. T
    2. F
    3. T
    4. T
    5. F






    Practice quiz 2

    Question 1

    Identify the prior distribution, the posterior distribution and the likelihood function in the equation below.

    Answer:

    • P(theta|data) is the posterior distribution;
    • P(data| theta) is the likelihood function;
    • P(theta) is the prior distribution;

    Question 2

    Briefly (really!), contrast the prior distribution with the posterior distribution. What do they represent? How are they different?

    Answer:

    • The prior distribution represents your prior beliefs about the parameters of interest
    • while your posterior distribution is your prior beliefs updated by the evidence contained in the data.

    Question 3

    MCMC is used to sample from which distribution in the Bayesian Framework?

    Answer: To sample from the posterior distribution.

    Question 4

    What is the burn-in period in MCMC?

    Answer: It is the period of convergence of MCMC. During this period, the generated samples are not generated from the posterior distribution.
    Burn-in is intended to give the Markov Chain time to reach its equilibrium distribution, particularly if it has started from a lousy starting point. To "burn in" a chain, you just discard the first $n$ samples before you start collecting points.

    Question 5

    Contrast the effects of increasing the number of simulated samples in MCMC versus increasing in the amount of data available?

    Answer:

    • By increasing the number of simulated samples in MCMC, we can better describe the posterior distribution (more smooth).
    • On the other hand, by increasing the number of sample points, we decrease the variance of the posterior distribution.

    Question 6

    You've done a Bayesian analysis, and have found that a 95% credible interval for the parameter p is (0.4, 0.6). Provide an interpretation of this credible interval.

    Answer: The posterior probability that p lies between 0.4 and 0.6 is 95%.

    Question 7

    For a Binomial generative model, the "obvious" estimator of the underlying success probability $p$ is the ratio of number of successes observed to the number of trials.

    Consider instead the following Bayesian analysis: we put a Uniform(0,1) prior on $p$, and then given our observations we compute the posterior and take its mean as our estimate of $p$.

    Does this approach in general yield the same estimate as the simpler approach described above? Explain why or why not.

    Answer:

    • The uniform(0, 1) distribution is equivalent to a beta(1, 1) distribution. Therefore, by beta-binomial conjugacy, the posterior distribution is beta(#success + 1, #failure + 1). The posterior mean is $\frac{(\text{\#success} + 1)}{(\text{\#trial} + 2)}$, which is not the same as the sample mean, $p = \frac{\text{\#success}}{\text{\#trial}}$. The uniform prior can be intrepreted as having two artificial data points apriori: one success and one failure. Therefore, the posterior mean would be shifted towards 0.5.
    • Extending this to a beta($a$, $b$) prior, the amount of prior information can be interpreted as having seen $a$ success and $b$ failure previous before the observations are collected in the current study.

    Question 8

    What two sources does the uncertainty of the posterior come from?

    Answer: The prior and the likelihood.

    Question 9

    If you choose a smaller and smaller variance for the prior, does this mean the data have more or less influence on the posterior? Why?

    Answer: Less of an influence, because smaller variance in either the prior or likelihood model leads to smaller variance of the posterior.

    Question 10

    Unlike in frequentist statistics, in Bayesian statistics we need to specify a prior distribution over our parameters. Name one advantage and one disadvantage of needing to specify a prior distribution.

    Answer:

    • Advantage: Bayesian statistics allows us to use prior knowledge, instead of just relying on the data.
    • Disadvantage: Bayesian statistics forces us to use prior knowledge, instead of just relying on the data, which may lead to the analysis being too subjective. Another disadvantage is that Bayesian statistics is typically more computationally expensive.






    Appendix

    1. Distributions

    Bernoulli Distribution

    Normal Distribution

    Log Normal Distribution

    Poisson Distribution

    Binomial Distribution

    Negtive Binomial Distribution

    Exponential Distribution

    Gamma Distribution

    Beta Distribution